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Abstract 

We propose a pointwise inference algorithm for high-dimensional linear mod¬ 
els with time-varying coefficients. The method is based on a novel combination 
of the nonparametric kernel smoothing technique and a Lasso bias-corrected ridge 
regression estimator. Due to the non-stationarity feature of the model, dynamic 
bias-variance decomposition of the estimator is obtained. With a bias-correction 
procedure, the local null distribution of the estimator of the time-varying coefficient 
vector is characterized for iid Gaussian and heavy-tailed errors. The limiting null 
distribution is also established for Gaussian process errors, and we show that the 
asymptotic properties differ between short-range and long-range dependent errors. 
Here, p-values are adjusted by a Bonferroni-type correction procedure to control the 
familywise error rate (FWER) in the asymptotic sense at each time point. The finite 
sample size performance of the proposed inference algorithm is illustrated with syn¬ 
thetic data and an application to learn brain connectivity by using the resting-state 
fMRI data for Parkinson’s disease. 


1 Introduction 

We consider the time-varying coefficient models (TVCM) 

y(t) = x(t) T (3(t) + e(t) (1) 

where t G [0,1] is the time index, y(-) the response process, x(-) the px 1 deterministic 
predictor process, /3(-) the p x 1 time varying coefficient vector, and e(-) the mean zero 
stationary error process. The response and predictors are observed at ti = i/n,i = 1, ..., n, 
i.e. yi = y(ti), x,; = x(tj), and e; = e(t*) with a known covariance matrix S e = Cov(e) where 


1 



e = (ei, • • • , e n ) T . TVCM is useful for capturing the dynamic associations in the regression 


models and longitudinal data analysis [16], and it has broad applications in biomedical 


engineering, environmental science, and econometrics. In this paper, we focus on the 
pointwise inference for the time-varying coefficient vector (3{t) in the high-dimensional 
double asymptotic framework min(p, n) —> oo. 

Nonparametric estimation and inference of the TVCM in fixed dimension has been 
extensively studied, see e.g. [27, HE [12], T5j [39i |25, 8], '38, 0Tj. In high dimensions, variable 
selection and estimation of varying-coefficient models using basis expansions have been 
studied in [351 GUI GIJ. Our primary objective is not to estimate (3(t), but rather to 
perform statistical inference on the coefficients. In particular, for any t G (0,1), we wish 
to test the local hypothesis, for j = 1, • • • ,p, 


0 VS H lJt : m + 0. 


( 2 ) 


By assigning p-values at each time point, we construct a sequence of estimators of the 
coefficient vectors that allows us to assess the uncertainty of the dynamic patterns in such 
as brain connectivity networks. Confidence intervals and hypothesis testing problems of 


lower-dimensional functionals of the high-dimensional constant coefficient vector (3(t) 


(3, Vi G [0,1], have been studied in [5, [37, [19] - To the best of our knowledge, little has been 
done for inference of the high-dimensional TVCM and our goal is to fill the inference gap 


between the classical TVCM and the high-dimensional linear model. 


While the existing literature on high-dimensional linear models is based on iicl errors, 
0GZI 119] . we provide an asymptotic theory for answering the question that to which 
extent the statistical validity of inferences based on iid errors can hold for dependent 
errors. Allowing temporal dependence is of the practical interest as many datasets such as 
fMRI data are spatio-temporal and the errors are naturally correlated in the time domain. 
Theoretical analysis has revealed that the temporal dependence has delicate impact on the 
asymptotic rates for estimating the covariance structures, HfflUU. Therefore, it is useful 
to build an inference procedure that is also robust in the time series context. The error 
process e* is modelled as a stationary linear process 



(3) 


m =0 
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where do = 1 and are iid mean-zero random variables (a.k.a. innovations) with variance 
cr 2 . When the are normal, the linear processes of form (j3j) are Gaussian processes that 
cover the autoregressive and moving-average (ARMA) models with iid Gaussian innova¬ 
tions as special cases. For the linear process, we deal with both weak and strong temporal 
dependences. In particular, if a m = 0((m + l) _p ) and g > 1/2, then e; is well-defined 
and has (i) short-range dependence (SRD) if g > 1, (ii) long-range dependence (LRD) if 
1/2 < q < 1. For the SRD processes, it is clear that Ylm =o l a m| < 00 an d therefore the 
long-run variance is finite. 

The paper is organized as follows. In Section [2j we describe our method in details. 
Asymptotic theory is presented in Section [3j Section [4] presents some simulation results 
and Section [5] demonstrates an application to an fMRI dataset. The paper concludes in 
Section [6] with a discussion of some future work. Proofs and some implementation issues 
are available in the Appendix. 


2 Method 


2.1 Notations and Preliminary 

Let K be a non-negative symmetric function with bounded support in [— 1,1], K{x)dx = 
1, and let b n be a bandwidth parameter satisfying b n = o(l) and n~ l = o{b n ). For each 
time point t G w — [b n , 1 — b n \, the Nadaraya-Waston smoothing weight is defined as 


w(i, t ) 


Kb n (tj t ) 
52m= 1 Kb n (tm—t) 

0 


if |ti -t\ <b n 
otherwise 


( 4 ) 


where A/(-) = K(-/b). Let N t = { i : \ti — t\ < b n } be the ^-neighborhood of time t, 
|A*| be the cardinality of the discrete set N t , W t = diag {u>(i,t) i£Nt ) be the \N t \ x \N t \ 
diagonal matrix with G N t on the diagonal, and let !Z t = span(x,; : i G N t ) be the 

subspace in M p spanned by Xj, the rows of design matrix A" in the N t neighborhood. Let 
X t = (w(i,t) 1 / 2 x.i)J eNt , y t = (w(i,t) l / 2 yi)]^ Nt , and S t = (w(i,t) 1 / 2 ei)J eNt . Denote I p as the 
p x p identity matrix. We write the singular value decomposition (SVD) of X t as 


Xt = pdq t 


( 5 ) 
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where P and Q are |iV t | x r, and p x r matrices such that P T P = Q T Q = I r , and 
D = diag(di, • • • , d r ) is a diagonal matrix containing the r nonzero singular values of X t . 
Now let Put be the projection matrix onto IZt, 

Pn t = X? {XtX?)-X t = QQ t , ( 6 ) 


where (X t Xf) = PD 2 P T is the pseudo-inverse matrix of X t Xf. Let 0(f) = P^ t /3(f) be 
the projection of /3(f) onto lZ t such that B(t) = 0(t) — /3(f) is the projection bias. Let 

0(A) = (XjX t + A I p )- l Xfwl l2 ^ t Wl l2 X t (XfX t + A/p) -1 (7) 


be the covariance matrix of the time-varying ridge estimator defined in (J9]), where i = 
Cov((ej)jgAr t ) and A > 0 is the shrinkage parameter of the ridge estimator. Let O m ; n (A) = 
minj< p Djj(X) be the smallest diagonal entry of 0(A). For a generic vector b E M p , we write 
|b| q = (Y7j=i I b j\ q Y /q if q > 0, and |b| 0 = Y7j=i f 0) if q = 0. Let w t = inf i&Nt w(i,t ) 
and w t = sup ieA r w(i,t). For an n x n square symmetric matrix M and an n x rn rectangle 
matrix R, we use Pi(M ) and <Ji(R) to denote the i-th largest eigenvalues of M and singular 
values of R, respectively. If k — rank(P), then crfR) > cr 2 (P) > ••• > cr fc (i?) > 0 = 
cr k+1 (R) = ■■■ = <7 max ( miTl )(i2), zeros being padded to the last ma x(m,n) — k singular 
values. We take p ma , x (M), p min (M) and p m m^o(M) as the maximum, minimum and nonzero 
minimum eigenvalues of M, respectively, and |M|oo = maxi^-^^p \Mj k \. Let 


Pmax(d7, S) 


max 

|b| 0 <s,b^O 


b T Mb 

b T b 


If M is nonnegative definite, then p max (M, s) is the restricted maximum eigenvalues of M 
at most s columns and rows. 

The p-dimensional coefficient vector /3(f) is decomposed into two parts via projecting 
onto the N t |-dimensional linear subspace spanned by the rows of X t and its orthogonal 


complement; see Figure 1(a) A key advantage of this decomposition is that the projected 
part can be conveniently estimated in closed-form, for example, by the ridge estimator since 
it lies in the row space of X t and thus is amenable for the subsequent inferential analysis. 
In the high-dimensional situation, this projection introduces a non-negligible shrinkage 
bias in estimating /3(f) and therefore we may lose information because p \N t \. On the 
other hand, the shrinkage bias can be corrected by a consistent estimator of /3(f). As a 
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(a) Bias correction by projection to the (b) Smoothly time-varying row space 
row space of X t . of X t . 


Figure 1: Intuition of the proposed algorithm in Section 2.2 


particular example, we use the Lasso estimator, though any sparsity-promoting estimator 
attaining the same convergence rate as the Lasso should work. Because of the time-varying 
nature of the nonzero functional /3(f), the smoothness on the row space of X t along the 
time index f is necessary to apply nonparametric smoothing technique; see Fig. |l(b) As 
a special case, when the nonzero components /3(f) = /3 are constant functions and the 
error process is iid Gaussian, our algorithm is the same as that of [5]. Here, we emphasize 
that (i) coefficient vectors are time-varying (i.e. non-constant), (ii) errors are allowed to 
have heavy-tails by assuming milder polynomial moment conditions and to have temporal 
dependence, including both SRD and LRD processes. There are other inferential methods 
for high-dimensional linear models such as [371 Il9j . We do not explore specific choices 
here since our contribution is a general framework of combining nonparametric smoothing 
and bias-correction methods to make inference for high-dimensional TVCM. However, we 
expect that a non-stationary generalization would be feasible for those methods as well. 
Some simulation comparisons are provided for time-varying versions of the bias-correction 
methods in Section [4] 
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2.2 Inference algorithm 


First, we estimate the projection bias B[t) by B{t) = (pR t — I p )/3(t), where /3(f) is the 
time-varying Lasso (tv-Lasso) estimator 

/3(f) = argmin -xjb) 2 + Ai|b|i (8) 

i€N t 

= argmin \y t - X t h\ 2 2 + Ai|b|i. 

beRp 

Next, we estimate 0(t) = Pjz t l3(t ) using the time-varying ridge (tv-ridge) estimator 


0(t) = arg min 

beRp 


53 w ( i ’ t )(Vi ~ *J h ) 2 + ^21b11 

ieNt 


= (x?x t + \ 2 ip)- 1 x^y t . 


(9) 


We defer the discussion of tuning parameters choice Ai and A 2 to Section |3j Our tv-Lasso 
bias-corrected tv-ridge regression estimator for (3{t) is 


m = ~e(t) - Bit). 


( 10 ) 


Based on /3(f) = (/3i(t), • • • ,/3 p (t)) T , we calculate the raw two-sided p-values for individual 
coefficients 


Pj = 2 


\Pj(t) \ - A-i e max ¥i \(Pn t )jk\ 


1 _$ 




jj 


(A 2 


J = (11) 


where £ G [0,1) is user pre-specihed number that depends on the number of nonzero /3(f). 
In particular, if |supp(/3(f))| is bounded, then we can choose £ = 0. Generally, following 
[5), we use £ = 0.05 in our numeric examples to allow the number of nonzero components 
in /3(f) to diverge at proper rates. Let v(f) = (Vi(f), ■ • • , V p (t)) T ~ iV(0, il(A 2 )) and define 
the distribution function 


F(z) = P 


min 2 

j<p 


$ m 


)-l/2 

L ii 


(A 2 )|V)(«)|) 



( 12 ) 


We adjust the Pj for multiplicity by Pj = F(Pj + £), where £ is another pre-defoied small 
number [5j that accommodates asymptotic approximation errors. Our decision rule is 
defined as: reject H 0 j jt if Pj < a for a G (0,1). For iid errors, since £ e = a 2 I n and 


0(A 2 ) = a 2 (X?X t + \ 2 I p )- x XjW t X t {Xj X t + A 2 I p )-\ 
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we see that F(-) is independent of a. Therefore, F(-) can be easily estimated by repeatedly 
sampling from the multivariate Gaussian distribution N(0, G(A 2 )). A similar observation 
has been made in J5|. 


3 Asymptotic results 


In this section, we present the asymptotic theory of the inference algorithm in Section 2.2 


First, we state the main assumptions for iid Gaussian errors. 

1. Error. The errors e, ~ N(0,a 2 ) are independent and identically distributed (iid). 

2. Sparsity. /3(-) is uniformly s-sparse, i.e. sup tg [ 01 j S)*| < s, where S% = {j : f3j(t) ^ 
0} is the support set. 

3. Smoothness. 

(a) /3(-) is twice differentiable with bounded and continuous first and second deriva¬ 
tives in the coordinatewise sense, i.e. (3j (•) G C 2 ([0,1], Co) for each j = 1, • • • ,p 
and Co is an upper bound for the partial derivatives. 

(b) The ^-neighborhood covariance matrix E* = | AA | —1 ^2 ieN Xjx/^ := X° T X° 
satisfies 



o 2 < 


(13) 


4. Non-degeneracy. 


lim inf G min (A) > 0. 


(14) 


5. Identifiability. 


(a) The minimum nonzero eigenvalue condition 

Pmin^o(^t) > £ 0 > 0. 

(b) The restricted eigenvalue condition 


(15) 



(16) 

where = Xj X t is the kernel smoothed covariance matrix of the predictors. 
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6. Kernel. The kernel function K(-) is nonnegative, symmetric around 0 with bounded 
support in [—1,1], 

Here, we comment the assumptions and their implications. Assumption 1 and 6 are stan¬ 
dard. The Gaussian distribution is non-essential and can be relaxed to sub-Gaussian and 


heavier tailed distributions (Theorem 3.4). Assumption 2 is a sparsity condition for the 
nonzero functional components and allows that s —» oo slower than min(p, n). It is a key 
condition for maintaining the low-dimensional structure when the dimension p grows with 
the sample size n. By the argument of Theorem 5 in [40], it implies that the number of the 
first and second non-vanishing derivatives of (3{t) is bounded by s almost surely on [0,1]. 
Assumption 3 ensures the smoothness of the time-varying coefficient vectors and the design 
matrix so that nonparametric smoothing techniques are applicable. Examples of Assump¬ 
tion 3(a) include the quadratic functions (3{t) = (3 + at + ^f 2 /2 and the periodic functions 
(3{t) = (3 + Q:sin(t) + £ cos (t) with |a|oo + |^ < C 0 . Assamption 3(b) can be viewed as 

Lipschitz continuity on the local design matrix that is smoothly evolving, mi. It is weaker 
than the condition that p m ax (S^) < £$ 2 because the latter may grow to infinity much faster 
than the restricted form Jl3| ). Assumption 4 is required for a non-degenerated stochastic 
component of the proposed estimator which is used for the inference purpose. Assumption 
5(a) and 5(b) together impose the identifiability conditions for recovering the coefficient 
vectors. The analogous condition of the time-invariant version has been extensively used 
in literature to derive theoretical properties of the Lasso model; see e.g. 


For the tv-lasso bias-corrected tv-ridge estimator (10), we have the following represen¬ 
tation theorem. 


Theorem 3.1 (Representation). Fix t G w and let 


L t g = max 
j<p 


J2w(i,tyX ri 

j&Nt 


1/2 


= 1 , 2 , 


A 0 = AtxLtpyJlogp, (17) 


and Ai > 2(X 0 + 2CoL tt ib n (s\N t \wt) 1 ^ 2 £ 0 1 ) ■ If A 2 = o(l) ; Assumptions 1-6 hold, and 
C < \N t \w t < \N t \w t < Cfor some C G (0,1), then (3{t) admits the decomposition 


m 

Z (t) 

17/(01 


= (3{t) + z(t) +7(0, 

~ N{ 0,O(A 2 )), 

^ \ 2 \e(t)\ 2 + 2C 0 s 1 ' 2 h n , 4A lS|D 
- Ce 2 + (f 2 1 nt p|o °’ 
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3 = ,P, 


(18) 

(19) 

( 20 ) 









with probability tending to one. If (3j{t) = 0, then we have 


«r/ / 2 (A 2 )( 4 (t) - n(t)) ~ jv(o, i), 


33 


(21) 


where 


i uw ^ \2\e{t )\ 2 + 2C 0 s 1 /' 2 b n 4Ais 
TOW I < - 7^2 -+ —02 




( 22 ) 


with probability tending to one. 


Remark 1. The decomposition (18) can be viewed as a local version of the one proposed 


in [5] (Proposition 2). However, due to the time-varying nature of the nonzero coefficient 


vectors, both the stochastic component z (t) in (19) and the bias component -y(f) in (20) 


differ from [5j. First, our bound (20) for bias has three terms arising from: ridge shrink¬ 
age, non-stationarity and Lasso correction, and each has localized features depending on 
the bandwidth b n of the sliding window and the smoothness parameter C$. Second, the 


stochastic part (19) also has time-dependent features in the covariance matrix (i.e. fI(A 2 ) 


implicitly depends on t though X t ) and the scale of normal random vector is different 
from [5]. Delicate balance among them allows us to perform valid statistical inference such 
as hypothesis testing and confidence interval construction for the coefficients and, more 
broadly, their lower-dimensional linear functionals. 

Example 3.1. Consider the uniform kernel K(x) = 0.5I(|x| < 1) as an important special 
case, the kernel used for our numeric experiments in Section]!} In this case, w t = (2nb n )^ 1 
and \N t \w t = \N t \w t = 1. It is easily verified that under the local null hypothesis 


(22) can be simplified to 


7 j(t) =0 (\ 2 \0(t)\ 2 +s 1/2 b n +Xtsm&xKPTiJjkl 
V k ^3 


From this, it is clear that the three terms correspond to bias of ridge-shrinkage, non- 
stationarity and Lasso-correction. The first and last components have dynamic features 
and the non-stationary bias is controlled by the bandwidth and sparsity parameters. The 


condition C < \N t \w t < \N t \wt < C^ 1 in Theorem 3.1 rules out the case that the kernel 
does not use the boundary rows in the localized window and therefore avoids any jump in 
the time-dependent row subspaces. 
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Remark 2. In Theorem |3.1[ the penalty level for the tv-Lasso estimator Ai can be chosen as 
0(<jL t , 2 y/\ogp + Lt^^bn). The second term in the penalty is due to the non-stationarity 
of /3(f) and the factor s 1//2 arises from the weak coordinatewise smoothness requirement 
on its derivatives (Assumption 3(a)). In the Lasso case with /3(f) = /3 and w(i,t ) = n _1 , 
an ideal order of the penalty level Ai is cm -1 maXj< p (^’! =1 A 7 " 2 ) 1//2 (logp) 1 / 2 see e.g. [;2J. 
In the standardized design case n~ l X 2 - = 1 so that L tt i = 1 and L t:2 = n -1 / 2 , the 
Lasso penalty is 0(a(n~ 1 logp) 1 / 2 ), while the tv-Lasso has an additional term s 1 / 2 6 n that 
may cause a larger bias. In our case, we estimate the time-varying coefficient vectors by 
smoothing the data points in the localized window. Thus, it is unnatural to standardize the 
reweighted local design matrix to have unit I 2 length and the additional bias 0(s 1 ^ 2 b n ) is 
due to non-stationarity. If the Xij are iid Gaussian random variables without standardiza¬ 
tion and we interpret the linear model as conditional on X, then, under the uniform kernel, 
we have L 2 2 = Op(\ogp/\N t \) and, in the Lasso case, the penalty level is 0(a\N t \~ 1 ^ 2 logp). 
If s = O(logp) and b n = O^logp/n) 1 / 3 ), then the choice in Theorem 
order as the Lasso with constant coefficient vector. 


3.1 


has the same 


Based on Theorem 3.1[ we can prove that the inference algorithm in Section [2]2] asymp¬ 
totically controls the familywise error rate (FWER). Let a G (0,1) and FP Q (f) be the 
number of false rejections of H 0j t based on the adjusted p-values. In the asymptotic 
statement, p := p{n ) is a function of n such that p —* oo as n —* oo. 


Theorem 3.2 (Pointwise inference: multiple testing). If the conditions of Theorem 3.1 
hold and 

A2 |^(f) I2 + S 1//2 ^n = o(A2 m i n (A2) 1//2 ), (23) 

then we have for each fixed f€o 


limsupP(FP Q (f) > 0) < a. 


(24) 


The proof of Theorem 3.2 is standard by combining the argument of Theorem 2 in [5] 


and Theorem 3.1 Therefore, we omit the proof. Condition (23) essentially requires that 


the shrinkage and non-stationarity biases of the tv-ridge estimator together are dominated 


by the variance; see also the representation (18), (19), (20), and (21). This is mild condition 


for two reasons. First, in view that variance of the tv-ridge estimator is lower bounded 


when X 2 is small enough; c.f. (14), the first term is quite weak in the sense that the 
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tv-ridge estimator acts on a much smaller subspace with dimension \N t \ than the original 
p-dimensional vector space. Second, for the choice of penalty parameter of Ai in Theorem 
the term s 1 ^ 2 b n in (23) is at most Ai. Hence, the bias correction (including the 


3.1 


projection and non-stationary parts) in the inference algorithm ( [TT| has a dominating 
effect on the second term of (23). Consequently, provided A 2 is small enough, the bias 
correction step in computing the raw p-value asymptotically approximates the stochastic 
component in the tv-ridge estimator. 


Remark 3. The Bonferroni correction (12) for the raw p-values is often conservative and 
thus it may be sub-optimal in power. In our simulation studies, it seems that detection 
power is reasonable while the FWER is controlled at 0.05; c.f. Table [l] and [2] To improve 
the power, one can consider the control of the false discovery proportion (FDP) by the 
principal factor approximation (PFA) method proposed in |14l H3] . By Theorem 3. 1[ under 
the global null hypothesis H o t : j3i(t) — ■■■ — f3 p (t) = 0, we have /3(f) — 7 (f) ~ iV(0, D(A 2 )) 
with a known covariance matrix ff(A 2 ). Therefore, our test statistic is jointly normal and 
the PFA can be applied to control the FDP if D(A 2 ) can be well approximated by the 
covariance matrix of a factor model plus a weakly dependent component. [30J provided a 
practical procedure to estimate the FDP. 

Next, we relax the iicl assumption on the errors to allow temporal dependence. 

Theorem 3.3 (Gaussian process errors). Suppose that the error process e* is a mean- 
zero stationary Gaussian process of form |5j) such that \a m \ < K(m + 1) _<? for some 
g G (1/2,1) U (1, 00 ) and finite constant K > 0. Under Assumptions 2-6 and the notation 
of Theorem \3.1\ with 

4aL ti 2 |a| lv /logp ifg> 1 

C etK ^L ti2 n 1 ~ e y/logp if 1 > q > 1/2 


An — 


(25) 


where a = (ao, ap, • • ■ ) T , we have the representation of /3(f) in (18)-[22) with probability 
tending to one. 


From Theorem 3.3, the temporal dependence strength has a dichotomous effect on the 
choice of Ao, and therefore on the asymptotic properties of /3(f). For e* with SRD, we have 
|a|i < 00 and Ao x aL ti2 -y/logp. Therefore, the bias-correction part 7 (f) of estimating 
/3(f) has the same rate of convergence as the iid error case. The temporal effect only plays 
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a role in the long-run covariance matrix of the stochastic part z (t). If e* has LRD, then 
the temporal dependence has impact on both 7 (f) and z(£). In addition, the choice of the 
bandwidth parameter b n is different from the SRD and iid cases. In particular, the optimal 
bandwidth for g G (1/2,1) is (^((logp/n 5 ) 1 / 3 ) which is much larger than O^logp/n) 1 / 3 ) 
in the iid and SRD cases, assuming s is bounded. The boundary case g — 1 can also be 
characterized; details are omitted. 

We also relax the moment condition on the errors that, in the iid error case, are assumed 
to be zero-mean Gaussian. First, it is easy to relax this assumption to distributions with 


sub-Gaussian tails (see Definition 7.1 in the Supplementary Material) and Theorem 3.1 


and T2 continue to hold, in view that the large deviation inequality and the Gaussian 
approximation for a weighted partial sum of the error process only depend on the tail 
behavior and therefore on moments of e t . Second and more importantly, the sub-Gaussian 
assumption may be knocked down to allow iid noise processes with algebraic tails, or 
equivalently e* with moments up to a finite order. The consequence of this relaxation 
is that a larger penalty parameter for the tv-Lasso is needed for errors with polynomial 
moments. Let S be the square root matrix of Ul(\ 2 )/a 2 (i.e. f 2 (A 2 ) = cr 2 SS T ) and £ • be 
the j-tli row of S. 


Theorem 3.4 (Heavy-tailed errors). Under the conditions of Theorem 3.1 with E|e / 9 < 
00 ,q > 2 , choose 

A 0 = < 7 , max {(pfi ntq ) 1/q , crL ti2 (log p) 1/2 } , for large enough C q > 0 , (26) 


where /x „ i9 = \™{ht) x ij\ q ■ J f \€j\q = °(l£j| 2 ) for all j = 1 , ■ ■ ■ ,p, then holds 

with probability tending to one and Theorem 3.2 \ holds. 

The assumption |£ -| 9 = 0 (^/ 2 ) is needed to ensure the asymptotic validity of the 
Gaussian approximation of the ridge component (19) for non-Gaussian data. 


4 Simulation studies 

4.1 Simulation setup 

In the simulation studies, we generated the n x p design matrix with iid rows sampled 
from N( 0 , Ex) for n = 300 and p = 300. We considered two covariance structures on the 
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design matrix: (i) E x = I p ] (ii) S x = T, where T = ( tj k ) p - k=l and tjk = 0.5b’ _fc L The 
time-varying coefficient vectors (3{t) had s = 3 non-zero elements and p — 3 zeros for all 
t E [0,1]. The non-zero elements in (3(t) were generated by sampling nodes from a uniform 
distribution C/(—2.5, 2.5) at regular time points and smoothly interpolating on the interval 
[0,1] using the cubic splines. We simulated the following stationary error processes. 

1. The e* are iicl N{ 0,1). 

2. e* = + & is an AR(1) process where ip E {0.2, 0.5} and £, are iid N(0, 1). 

3. The e* are iid Student’s t( 3)/\/3. 

4. 6i is a long-memory process e* = Ylm= o( m + where g = 0.75 and are iid 

Gaussian with mean zero. 

We compared the performance of the proposed method with the following. 

1. (TV-Lasso) - The time-varying Lasso, the kernel smoothed time-varying LASSO 
defined in (J8]) , where Ai is selected by the cross-validation (CV). 

2. (FP-Lasso) - The false-positive Lasso, where Ai is tuned to match the FWER of the 
proposed method. This allowed us to compare the power at similar levels of FWER. 

3. (TV-LDPE) - An adaptation of the de-biased LASSO inference procedure by [37] to 
the kernel smoothed, time-varying setting. 

4. (TV-SDL) - An adaptation of the SDL test of [Ki] to the kernel smoothed, time- 
varying setting. 

5. (Non-TV) - The original non-time-varying method of [5] that ignores the dynamic 
structures. The penalty parameter Ai in the Lasso is set to the scaled Lasso parameter 
V 2 l °s(p)/n. 

In all time-varying models, we used the kernel bandwidth b n = 0.1. For the proposed 
method, we used A 2 = 1 jn and ( = 0. We let Pj,t,m be the multiplicity-adjusted p-value 
for testing the hypothesis Ho.j.t '■ Pj{t) = 0 for t E w = [b n , 1 — b n \ in the m-th Monte 
Carlo simulation for m — 1, • • • , M. For TV-LDPE, TV-SDL, the proposed method and 
its non-tv version, we adopted the following performance measures. 
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1. The (averaged) false positive rate (FPR) over the interval w , 


1 


M 


n( 1 — 2 b n )(p — s)M 


y y y —“)• 


tGtzt j£S c m= 1 


2. The (averaged) false negative rate (FNR) over the interval w, 


1 


M 


n( 1 — 2 b n )sM 



j£S m =1 


3. The (averaged) FWER over the interval xzr, 


FWER 



For the Lasso-based methods (TV-Lasso and FP-Lasso), the probabilities are replaced 
by the corresponding indicators of whether or not the estimated coefficients are zero. 

4.2 Empirical results 

For each simulation setup, we report the FPR, RNR, FWER, and the root mean square 


errors (RMSE) of the estimates. The results are shown in Tables [T] and [2j from which 


we make several observations. First, TV-Lasso and the method of [5] do not control the 
FWER, while the proposed method can control the FWER at the nominal level a = 0.05 
in all setups. Second, the proposed method has uniformly higher power than FP-Lasso, 
the TV-Lasso tuned to match the FWER with our method. This is probably explained 
by the bias of the i\ regularization in the TV-Lasso. Third, for the design matrix with iid 
Gaussian entries, TV-LDPE and TV-SDL have comparable performance as our proposed 
method in terms of the power, while all three methods have the FWER controlled below 
0.05. TV-LDPE and TV-SDL are more sensitive to the design matrix than the proposed 
method; for the Toeplitz design matrix T, TV-LDPE and TV-SDL seem to lose control on 
the FWER. Moreover, the FWER, FNR, and RMSE are larger as the dependence level of 
the error process grows and as the tail of the error distribution becomes thicker. Finally, 
the proposed method is computationally more economical than the competing methods 
TV-LDPE and TV-SDL. Table [3] shows the runtimes on an Intel i5-4790K with the Intel 
MKL linear algebra libraries (software and platform: R 3.2.2 for Windows). 
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Table 1: Simulation results, n = 300, p = 300, s = 3. 



iid 

N(0, I p ), e 

~ JV(0,/ n ) 


iid 

N(0,T), e 

~ N(0,I n ) 


Method 

FPR 

FNR 

FWER 

RMSE 

FPR 

FNR 

FWER 

RMSE 

TV-Lasso 

7.51 x 10 -2 

0.0551 

1 

0.0537 

1.55 x lO” 1 

0.0684 

1 

0.0520 

FP-Lasso 

1.50 x 10“ 4 

0.2352 

0.0344 

0.1124 

2.81 x 10“ 5 

0.5170 

0.0063 

0.1318 

Proposed 

1.50 x 10“ 4 

0.1889 

0.0346 

0.1838 

2.81 x 10“ 5 

0.4072 

0.0063 

0.1984 

TV-LDPE 

1.29 x 10“ 4 

0.1981 

0.0254 

0.2652 

3.77 x 10" 4 

0.3615 

0.0743 

0.2727 

TV-SDL 

1.53 x 10 -4 

0.1848 

0.0357 

0.1316 

1.08 x 10 -3 

0.3006 

0.2119 

0.1409 

Non-TV 

4.89 x 10 _1 

0.5100 

0.4600 

0.5762 

2.58 x 10 _1 

0.7100 

0.2400 

0.6084 


Xi AT(0, /p), e ~ AR(1) with <p 

= 0.2 

Xi N(0,T), e ~ AR(1) with p> 

= 0.2 

Method 

FPR 

FNR 

FWER 

RMSE 

FPR 

FNR 

FWER 

RMSE 

TV-Lasso 

8.93 x 10" 2 

0.0563 

1 

0.0533 

1.51 x 10 _1 

0.0629 

1 

0.0519 

FP-Lasso 

1.78 x 10“ 4 

0.2363 

0.0384 

0.1099 

7.21 x 10“ 5 

0.4709 

0.0173 

0.1302 

Proposed 

1.78 x 10 -4 

0.1891 

0.0376 

0.1836 

7.21 x 10 -5 

0.3434 

0.0173 

0.1920 

TV-LDPE 

9.85 x 10" 5 

0.1995 

0.0215 

0.2799 

3.70 x 10" 4 

0.3620 

0.0843 

0.2725 

TV-SDL 

1.97 x 10“ 4 

0.1883 

0.0419 

0.1374 

1.07 x 10“ 3 

0.3032 

0.2165 

0.1404 

Non-TV 

4.29 x 10 _1 

0.5633 

0.4100 

0.5557 

2.58 x 10 _1 

0.6933 

0.2200 

0.6088 


Xi AT(0, ip), e ~ AR(1) with ip 

= 0.5 

Xi V(0,T), e ~ AR(1) with p> 

= 0.5 

Method 

FPR 

FNR 

FWER 

RMSE 

FPR 

FNR 

FWER 

RMSE 

TV-Lasso 

7.55 x 10" 2 

0.0544 

1 

0.0537 

1.51 x 10 _1 

0.0611 

1 

0.0518 

FP-Lasso 

1.93 x 10 -4 

0.2402 

0.0431 

0.1124 

9.81 x 10 -5 

0.4805 

0.0222 

0.1303 

Proposed 

1.93 x 10“ 4 

0.1809 

0.0422 

0.1836 

9.81 x 10“ 5 

0.3347 

0.0235 

0.1918 

TV-LDPE 

9.82 x 10“ 5 

0.2028 

0.0229 

0.2756 

3.71 x 10“ 4 

0.3618 

0.0849 

0.2659 

TV-SDL 

1.94 x 10“ 4 

0.1898 

0.0425 

0.1374 

1.01 x 10“ 3 

0.2993 

0.2555 

0.1439 

Non-TV 

5.49 x IQ” 1 

0.4500 

0.5200 

0.5314 

1.70 x 10 _1 

0.8300 

0.1500 

0.6004 


5 Data example: learning brain connectivity 

We illustrate our proposed method in an application to model the functional brain con¬ 
nectivity in a Parkinson’s disease study. The problem is to construct brain connectivity 
networks from the resting-state functional magnetic resonance imaging (fMRl) data, where 
slowly time-varying graphs have implications in modeling brain connectivity networks. 
Traditional correlation analysis of the resting state blood-oxygen-level-dependent (BOLD) 
signals of the brain showed considerable temporal variation on small time scales, PEE]. 
In view of the high spatial resolution of LVIRI data, brain networks of subjects at rest are 
believed to be structurally homogeneous with subtle fluctuations in some, but a small num¬ 
ber of, connectivity edges, [201 IT]. A popular approach to learn brain connectivity is the 
neighborhood selection procedure, [23]. Therefore, high-dimensional TVCM with a small 
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Table 2: Simulation results (continued), n = 300 ,p = 300, s = 3. 


Xf ~ JV(0, Ip), &i ~ t( 3 )/y /3 Xi ~ N(0, T), e; ~ t{ 3)/%/3 


Method 

FPR 

FNR 

FWER 

RMSE 

FPR 

FNR 

FWER 

RMSE 

TV-Lasso 

9.14 x 10~ 2 

0.0518 

1 

0.0539 

1.57 x 10 _1 

0.0605 

1 

0.0506 

FP-Lasso 

1.96 x 10“ 4 

0.2193 

0.0398 

0.1120 

4.20 x 10“ 5 

0.4547 

0.0124 

0.1209 

Proposed 

1.96 x 10 -4 

0.1708 

0.0439 

0.1834 

4.20 x 10 -5 

0.3129 

0.0125 

0.1885 

TV-LDPE 

1.26 x 10~ 4 

0.2041 

0.0275 

0.2659 

3.62 x 10“ 4 

0.3043 

0.0810 

0.2719 

TV-SDL 

1.90 x 10“ 4 

0.1903 

0.0403 

0.1321 

9.54 x 10“ 4 

0.2814 

0.1960 

0.1381 

Non-TV 

5.16 x 10 _1 

0.4800 

0.4800 

0.5289 

2.24 x 10 _1 

0.7700 

0.1500 

0.5962 


X* 7V(0, Ip), e LRD with q 

= 0.75 

Xi ~ N{ 0, T ), e ~ LRD with q 

= 0.75 

Method 

FP(%) 

FN(%) 

FWER 

RMSE 

FPR 

FNR 

FWER 

RMSE 

TV-Lasso 

7.25 x 10 -2 

0.0652 

1 

0.0499 

1.77 x 10 _1 

0.0805 

1 

0.0556 

FP-Lasso 

1.60 x 10“ 4 

0.2496 

0.0450 

0.1091 

1.37 x 10“ 4 

0.5138 

0.0229 

0.1381 

Proposed 

1.60 x 10“ 4 

0.1783 

0.0433 

0.1806 

1.37 x 10 -4 

0.3376 

0.0243 

0.1953 

TV-LDPE 

1.08 x 10“ 4 

0.2067 

0.0238 

0.2653 

3.68 x 10 -4 

0.3648 

0.0859 

0.2691 

TV-SDL 

2.10 x 10 -4 

0.1924 

0.0501 

0.1386 

9.29 x 10 -4 

0.3014 

0.0186 

0.1448 

Non-TV 

5.19 x 10 _1 

0.4800 

0.4800 

0.5304 

5.49 x 10 _1 

0.4500 

0.4100 

0.5280 


Table 3: Runtime per 10 simulations. 


Method 

Runtime (in minutes) 

TV-Lasso 

0.5 

FP-Lasso 

9.5 

Proposed 

13 

TV-LDPE 

> 1000 

TV-SDL 

> 1000 

Non-TV 

< 0.5 


number of nonzero components is a natural approach to study the time-evolving sparse 
brain connectivity networks, in which the time-varying coefficients reflect the dynamic fea¬ 
tures of the corresponding edges in the networks. The neighborhood selection approach 
is an approximation to the full multivariate distributions while ignoring the correlation 
among the node-wise responses and this may cause certain power loss in finite samples. 
[23] showed that in terms of variable selection, these two approaches are asymptotically 
equivalent. 

Our data example uses fMRI data collected from a study of patients with Parkinson’s 
disease (PD) and their respective normal controls. PD is typically characterized by devi¬ 
ations in functional connectivity between various regions of the brain. Additionally, the 
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resting state functional connectivity has been shown as a candidate biomarker for PD 
progression and treatment, where more advanced stages or manifestations of PD are asso¬ 
ciated with greater deviations from normal connectivity. Each resting state data matrix 
in our example contains 240 time points and 52 brain regions of interest (ROI). The time 
points are evenly sampled and the time indices are normalized to [0,1]. 

The brain connectivity network was constructed using the neighborhood selection pro¬ 
cedure. In essence, it is a sequence of time-varying linear regressions enumerating each 
ROI as the response variable and sparsely regressing on all the other ROIs. Figure [2] and [3] 
show the estimated graphs of a normal subject and a PD subject at three sequential time 
points around t = 0.25 based on the proposed method. Red nodes are ROIs known to be 
associated with motor control and blue nodes are ROIs either known to be unrelated to 
motor control or whose functions in humans are not well understood. Different patterns of 
connectivity in the networks can be found by comparing normal and PD subjects. From 
the graphs, there are slow changes in the networks over time: most edges are preserved 
on a small time scale, but there are a few edges evolving over time. For instance, in a PD 
subject, ROI 1 and ROI 40 are unconnected in the first network but they are connected 
in the second network and remain connected in the third network. 

We also plot the connectivity graphs generated by the competing methods TV-LDPE 
and TV-SDL; see Figure [4] and [5} These graphs are denser than with the proposed method 
and are harder to interpret. Besides, as commented in the simulation studies, TV-LDPE 
and TV-SDL are more computationally expensive. The method of [5] cannot be used to 
capture the dynamic features of brain connectivity networks. 

6 Discussions 

This paper presents a pointwise inference algorithm for high-dimensional TVCM that can 
asymptotically control the FWER. Based on the current work, an interesting improvement 
would be to study simultaneous inference. Construction of the simultaneous confidence 
band (SCB) for the time-varying coefficients is useful for testing their parametric forms in 
high dimensions. This is a more challenging topic, which requires substantial additional 
work and probability tools that are beyond the scope of this paper. 

Our brain connectivity application is a subject-by-subject analysis. To perform the 
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Figure 2: Connectivity networks in control subject around t = 0.25 based on the proposed 
method. 



Figure 3: Connectivity networks in Parkinson’s Disease subject around t = 0.25 based on 
the proposed method. 



group analysis on the population level, a hierarchical linear model is more appropriate. 
When p is fixed, the linear mixed-effects model is widely used in performing the multi¬ 
level group analysis in fMRI, HH2HE3. The reason is that the generalized least squares 
(GLS) estimator of a two-level model is inferentially equivalent to the GLS estimator of 
the corresponding single-level model, provided that the second-level covariance is the sum 
of the group covariance and the covariance of the first-level estimate, [IJ. Extension of 
the testing problem on the population parameters based on the ridge and Lasso estimates 
when p —» oo is another interesting future research topic. 
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Figure 4: A connectivity network based on TV-LDPE. 



Figure 5: A connectivity network based on TV-SDL. 



7 Proof 


Proof of Theorem \3.1\ Observe that X t (3(t) = X t Oft) since Oft) = Pjz t f3(t). Using the 
closed-form formulae for the tv-ridge estimator 0 and by ([36]) , we have 


bias(0(f)) = E (Oft)) - Oft) = (X t ' + A 2 / p )- 1 A t T [X t Oft) + M t X t (3'ft) + X t £] - 0(t), (27) 


where |^|oo < C\)bf/2 and |£| 0 < s almost surely, t E w. First, we bound the shrinkage 
bias of the tv-ridge estimator. By the argument in Section 3 of [25] . we can show that 

{XjX t + A 2 I p )~ 1 X i fX t 0ft) - Oft ) = -Q{\f l D 2 + I r )- l Q T 0(t). 
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( 28 ) 


It follows from Lemma [7.21 that 
\Q{X^ 1 D 2 + I r )~ 1 Q T G{t)\ 2 < 




Pmin(-^2 D 2 + I r ) 

X 2 

\ 2 + miiij< r d 2 


/ x 2 \e(t )\ 2 ^\ 2 \e(t )\ 2 
1 ( )|2 5 WAS) - 


2 ’ 


where d 2 = pj{Y> t )ij = 1, • ■ ■ , r. Next, we deal with the non-stationary bias of the tv- 
ridge estimator (27) by a similar argument for (28). Indeed, let Q± be the orthogonal 
complement of Q such that Q\Q± = I p - r and Qj_Q = 0( p - r )xr ■ Denote T = [Q]Q±). 
Then, rr T = r T r = I p . By the SVD of X t , g, we have 


= [«;0 J 
= [Q\Q‘ 


Q 

Q 


T 


T 


(QD 2 Q t + \ 2 I p )[Q;Qa] 


X 2 I p 

)r y 1 

) l 

1 

h 

Q> 

1_ 

) 

1 

1—1 
Q> 
_1 


QDP T M t X t (3\t) 


0 


(D 2 + X .Ir)- 1 

0 

= Q(D + X 2 D- 1 )- 1 P r M t X t (3\t). 


\-i T 
/x 2 1 p—r 


DP T M t X t {3'(t) 

0 


Hence, by Lemma 7.2 we have 


\{XjX t + X 2 I p )~ 1 Xt M t X t (3'(t)\ 2 < 


bn wm 


< 


Cob n (yS\N t \wt)^^ 2 Eq 1 


Pmin(D + A 2 D- 1 ) min i < r (d j + A 2 /d j ) ’ 
where w t = sup ieNt w(i,t). Since A 2 = o(l) and dj > the denominator of 

last expression is lower bounded by [(| | 1 ^ 2 -^o + A 2 / / ((|-/Vt|;u; t ) 1 / 2 £o)] for large enough n. 

Therefore, we have 


\{XjX t + X.I^XjM t X t (3\t)\ 2 < 


CoMs|'V 1 |S ><) 1/2 „ G)M 1/2 


< 


(\N t \w t yiH? - 


04 


(29) 


Similarly, an upper bound for the remainder term of (27) can be established. We have 

Cobls 1/2 


\(X t 'X t + \ 2 I p )- 1 X t 'M t X t £\ 2 < 


2Ce 2 ’ 


for almost surely t G w. 


(30) 


In addition, 6{t) — E[0(i)] = {XjX t + A 2 I p )~ l XjE t is the stochastic part of the tv-ridge 
estimator. Since the e* ~ N(0,a 2 I n ) are iicl, £ t ~ N(0,a 2 W t ). Hence, 0{t) — E[0(£)] ~ 
IV(0, H(A 2 )), where H(A) is dehned in g , and thus 

Var(^(f)) = %(A 2 ) > H min (A 2 ). (31) 
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Now, we consider the initial tv-lasso estimator. By Lemma 7.3 


1/3(*) -/3(/)li < 4 <t>„ A,s 


( 32 ) 


Then, (18), (19), and (20) follow by assembling (28), (29), (30), and (32) into (10), 


m = /3(f) + bias(fl(f)) + {9(f) - E[0(t)]} - {(P Kl - I p )fm - /3(f)]}. 


The marginal representation (]2l|) and (]22|) follow from similar arguments by noting that 

□ 


B j(t) = Y.k&( p nt)jkPk{t) under 


Proof of Theorem 3.3. The proof of Theorem 3.3 is similar to that of Theorem 3.1 


so we 


only highlight the difference involving the error process. First, Cov(S t ) = W)^ 2 Yj e jW)f 2 . 


Second, instead of using (|37|) in proving Lemma 7.3[ we use Lemma 7.4 to get for all A > 0 

A 2 


P | max 

j<p 


w(i, t)X, 

i€N t 


ij&i 


> A < 2pexp 


2LIMW 


P I max 
j<p 


Yw(i,t)X , 


ij G 


i&N t 


> A ] < 2pexp 


C,A 2 


L 2 t2 n 2 P~e)a 2 K 2 


if q > 1 , 


if 0 6 (1/2,1). 


□ 


Proof of Theorem 3.f. The proof essentially follows the lines of that of Theorem 3.1[ but 
with differences in requiring a larger penalty parameter Ai of the tv-Lasso. First, by the 
Nagaev inequality m we have for any e > 0, 


P I max 

3<P 


Y W (p t ) X -ij e i 

ieNt 


> vL t - 2 e < (1 + 2/g)% 


PT 


n,q 


+ 2pexp ( -CgE 2 ) , 


q (aL tj2 £) q 

where c q = 2 e~ q (q + 2)~ 2 and K q is the g-th absolute moment of e 1 . Then, choosing 


e = C q max { (PL t " n ’ q ) — ) (logp) 1 / 2 ! for large enough C q > 0, 

l aL t,2 J 

we have maxj< p | w(i,t)Xijei\ = Op(A 0 ). Second, let S = {Xj X t + X 2 I p )~ 1 X i v W p2 

and Sf = {ef)J eNt . Recall that 0(t) — E[0(£)] = S £f. By the Gaussian approximation 
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[22, Theorem B], there exist iid Gaussian random variables g t ~ Y(0,cr 2 £ 2 j) defined on a 
possibly richer probability space such that for every t > 0 



Thus, it follows that 6j(t) — E[0,-(£)] = iV(0, Oy,(A 2 )) + Op(|^j| 9 ). As Qjj( A 2 ) = cr 2 |£ ? -||, the 
proof is complete for by assumption, |£-| 3 = o(|^| 2 ). □ 


□ 


Supplementary Material 

The supplementary material contains additional technical lemmas and discusses some 
implementation issues. 
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Supplementary Material 


Additional technical lemmas 


Lemma 7.1. Let X be an n x p matrix and D = diag(d\, ■ ■ ■ , d n ) with \di\ < b and b > 0. 
Then 

Pmax(X T DX, s) < 2 bp max (X T X,s). 

If di G [0,6], then p max (X T DX,s) < bp max (X T X, s). 

Proof. Let A s = {a e M p : |a| 2 < 1, |a| 0 < s}. Write di = df — d~, where df = max(rfj, 0) 
and d~ = max(—rfj, 0) are the positive and negative parts, respectively. By definition 


P ma x(A 1 DX, s ) = max [a 1 X 1 DXa\ = max |tr(D(A"aa T A" T ^ 

ae^ls ae^ls 


= max 

aS^l s 


- d~)(Xaa T X T ) 


i —1 


< '2b max V ( A"aa A" ) 
_ a eAr ^ 


i—1 


= 26maxtr(Xaa X ) = 26maxa X Xa = 26p max (X X, 5 ), 

atzAs aGvAs 
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because X T X is nonnegative definite. The second claim follows from the same lines with 


(L = 0 . 


□ 


Lemma 7.2. Let t 6 w and E t be the kernel smoothed sample covariance at time t and 
E^ = Xf T Xf. Suppose that Xf has full row rank. Assume further (15), (13) and assump¬ 
tion 6 hold, then we have 


Pmax (E t ,s) < \N t \w t e 0 2 . 


(33) 

(34) 


Proof. Since Xf is of full row rank, r = \N t \. Note that X t = (\N t \W t ) 1 / 2 Xff, pj(E t ) = 
a({X t ) and pi( E£) = af(Xf). By the generalized Marshall-Olkin inequality, see e.g. 


Theorem 4], assumption 6 and (15), we have 


Pmin^o(St) = Pmin(X t X^) = \N t \p min (W( /2 XfX; T W( /2 ) 


= \N t \p min (XfXf T W t ) > \N t \p min (W t ) p min (XfXf 1 ) > \N t \w t el 


O VO"T A 


The second inequality (34) follows from assumption 3(b) and Lemma 7.1 applying to 
E t = \N t \X? r W t X? and W t > 0. □ 

Lemma 7.3. Suppose assumption 1, 2, 3 and 5(a) hold. Let t e w be fixed and Aq be 


defined in (11). Then, for Ai > 2(A 0 + 2C 0 L t is 1 ^ 2 £ 0 1 b n \N t \w t ) where A 0 is defined in (11), 
we have, with probability 1 — 2p -1 , 


I Xt\J3(t) -f3(t)}\l + ^\m -P(t)h <4A?-^. 


(35) 


Proof. By definition 

13-4 — Xt(3(t) I 2 + Ai|/3(t)|i < \y t — X t (3(t) I 2 + Ai|/3(t)|i, 

which implies that 

\rnm - p{t)] i!+ \ 1 \m \1 < mpirn +2 (y t ~ x t m, mm - Pit)] >. 


By assumption 2 and Taylor’s expansion in the ^-neighborhood of t, we see that 


y t - XtPit) = £ t + M^it ) + x t i, 


(36) 
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where M t = diag((tj — t) ieNt ) and £ is a vector such that |£|oo < 2 and |£| 0 < s. 

Let J = {2\£jX t \ 00 < A 0 }. Observe that \£jX t \oc = maxj< p \ ^2 ieNt w(i, t)Xij e i\ and, by- 
assumption 1, 


Y w ih t)Xijei ~ N | 0, a 2 Y 

i£N t \ i£N t 


w(i,t) 2 x? 


(37) 


Then, by the standard Gaussian tail bound and the union bound, we obtain that 


fr 


P max 
\ j<p 


SieTVt w (h 


<jL 


t, 2 


> \Je 1 + 2 logp J < P(max|Zj| > \J e 2 + 2 logp) < 2 exp [—%- 


j<p 


for all e > 0, where Zj ~ 1V(0,1). Now, choose e = (21ogp) 1 / 2 and A 0 = 4o-L i)2 (logp) 1//2 , 
we have P(77) > 1 — 2p _1 . Further, we have 

\0\ty Xj- f}(t)}\ < \0(t)~ m\i\*7M t X t 0(t)U 

< | j3(t) — /j(7i i max f N ) w(i. t)X^ ) fi' it)~ it) |' " (Cauchy-Schwarz) 

^~ p \ie.Nt J 


< \$(t) - /3(i)| 1 Lt,iy Pmax(^ T M?X t ,s)\f3\t)\ 2 (assumption 2) 

< 1 P(t) - /3(t)|iL t yC' 0 s 1/2 6„^/p max (A’7T’ 4 ,s) (Lemma 

< |3(t) - /9(i)|i-£'i I iC , o(|7V t |iy t s) 1/2 6 n eo 1 (Lemma 


7.1, assumption 2 and 3) 


7.2, equation (34)). 


Similarly, we can show that \£ T X^X t [/3(f) — (3(t)\\ = O(Lt i i(|A^|'ir;ts) 1 / 2 6 2 £ 0 1 \(3(t) — /3(f)|i) 
Therefore, it follows that, with probability at least (1 — 2p -1 ), 

y t -Xtf3(t),Xt[m-m) | < [Ao + 2L i , 1 C 0 (|iV t |w tS ) 1 / 2 fe n£ - 1 (l + 0 (l))] |/3(t)-/3(t)| 

Now, choose Ai > 2(A 0 + 2L t ^C 0 (\N t \w t s) 1 ^ 2 b n £Q 1 ), we get 

2| X t [P(t) - /3(t)]\l + 2X 1 \P(t)\ 1 < A Mt) - /3(f)|i + 2A 1 |/3(f)| 1 . 

Denote S 0 := *S' 0 (t) = supp(/3(t)). By the same argument as |3, Lemma 6.3], it is easy to 
see that, on 77, 

2\X t [P(t) -m]\l + Arl^gWI, < Mi\Pso(t)-Pso(t)\i- 


i- 


But then, (35) follows from the restricted eigenvalue condition (assumption 5) with the 
elementary inequality 4 ab < a 2 + 45 2 that 

2\Xt[(3(t) — P(t)]\l + \t\Pit) —(3(t)\i < 4:\i\(3 So (t) — (3 So (t)\i < \X t [P(t)~P(t)}\l + 4X 

□ 
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Definition 7.1. A mean zero random variable is said to be sub-Gaussian with variance 
factor a 2 if 

logE(e A ' Y ) < A 2 a 2 /2 for all A G M. 

Lemma 7.4. Let be iid sub-Gaussian random variables with mean zero and variance 
factor a 2 , and e* = Ylm=o a m£i-m be a linear process. Let w = (uq, • • • , w n ) be a real 
vector and S n = Y^i=i w i e i be the weighted partial sum of e t . 

1. (Short-range dependence). If |a|i = l a *l < 00 > then f or all x > 0 we have 

p (|S„|>n< 2 exp(— 2|w| ^ |?g2 ). (38) 


2. (Long-range dependence). Suppose K = sup m>0 |a m |(m + l) e < oo ; where 1/2 < g < 
1. Then, there exists a constant C e that only depends on g such that 


IP(| S n | > x) < 2 exp 


C e x 2 \ 
'w\\n 2( d^ e la 2 K 2 ) 


(39) 


Proof. Put a m — 0 if m < 0 and we may write S„ = X] mez b m f m . where b m = V:'_, uyaj_ m . 
By the Cauchy-Schwarz inequality, 






tuETj 


mGZ \ 2=1 


E 

v 2=1 




^ I 121 12 

< W o ah. 


2— m \ I _ | vv \ 2\ cx \ l ' 


Then, (38) follows from the Cramer-Chernoff bound [3j. Let a m = rnax/> m \ai\ and A m = 
J2'i=o | a i\- Note that A n < K^2^ =0 (l + l) _e < C e K(n + l) 1_e , where C g = (1 — g)~ l . Then, 
we have 

n n / n 

E b ™< E 

m=l — 77 772=1 — 77 \ 2=1 

If m < —n, then |6 m | < |w|iai_ m and therefore 


E \ a i~m\ < \™\l A 


2 \ 2 
2t7* 


y 2=1 


E ^ Mi E °i-m < C Q n\w\ 2 2 K 2 n l 2q , 

772 <—77 772 <—77 

where the last inequality follows from Karamata’s theorem; see e.g. [26] . Hence, the proof is 
complete by invoking the Cramer-Chernoff bound for sub-Gaussian random variables. □ 
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Some implementation issues 


We assumed that the noise variance-covariance matrix E e is known. In the iicl error case 
E e = cr 2 / n , we have seen that the distribution F(-) is independent of a 2 and therefore 
its value does not affect the inference procedure. The noise variance only impacts the 
tuning parameter of the initial Lasso estimator. In practice, we can use the scaled Lasso 
to estimate cr 2 in our numeric studies. Given that |d/cr — 1| = op(l) [32], the theoretical 


properties of our estimator (10) remains the same if we plug in the scaled Lasso variance 
output to our method. For temporally dependent stationary error process, estimation of E e 
becomes more subtle since it involves n autocovariance parameters. We propose a heuristic 
strategy: first, run the tv-Lasso estimator and get the residuals; then calculate the sample 
autocovariance matrix and apply a banding or tapering operation B k { E) = {aj k l(\j — k\ < 
^)}j,fc=i 02 H 22J. 

We provide some justification on the heuristic strategy for SRD time series models. To 
simplify explanation, we consider the uniform kernel and the bandwidth b n = 1. Suppose 
we have an oracle where /3(f) is known and we have access to the error process e(t). Let 
E* be the oracle sample covariance matrix of e* with the Toeplitz structure i.e. the h -th 
subdiagonal of E* is a* h = n~ l YH‘=] e i e i+h- We first compare the oracle estimator and 
the true error covariance matrix E P . Let a > 0 and define 


T{a,C u C 2 ) = \Me ST pxp : ]T \m k \ < C x hr a , Pj {M) e [C 2 ,C^], Vj = 1, 


,P 


k=h -\-1 


where ST pxp is the set of all p x p symmetric Toeplitz matrices. If e* has SRD, then 


E e G T{q — 1, Ci, C 2 ). By the argument in [3] and Lemma 7.4, we can show that 


Pmax(-Bft(Eg) — E e ) < p max (-B/i(S*) — Bh( S e )) + p max (-E>/i(E e ) — E e ) 

V n 

Choosing h* x (n/logn) 1 ^ 2 ^, we get 

PraUMK) ~ Se) = 0 P ^ ^ . 

This oracle rate is sharper than the one established in [3J for regularizing more general 
bandable matrices if n — o(p). Here, the improved rate is due to the Toeplitz structure 
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in E e . Since E e has uniformly bounded eigenvalues from zero and infinity, the banded 
oracle estimator Bh (£*) can be used as a benchmark to assess the tv-Lasso residuals 

s t = y t -x t p(t). 


Proposition 7.5. Suppose E e 6 T(g — 1, C±, Cf) and, conditions of Lemma \7.c\ are satisfied 
except that (ef) is an SRD stationary Gaussian process with g > 1. Then 

PmUB h (±e) - B h (E* e )) = 0 P (hX lS 1/2 ). (40) 


With the choice h* x [n!/ logu/) 1 / 2 ^ where n' = \N t \, we have 
Pmax(-B/i(S e ) — S e )) = Op 


Q— 1 1 

log n' \ 2e ( n' \ 2e 

\ + 


n 


log n' 


siogp 

- -h sb n ) 


n 


(41) 


It is interesting to note that the price we pay to choose h for not knowing the error 


process is the second term in (41). Bandwidth selection for the smoothing parameter b n is 
a theoretically challenging task in the high dimension. Asymptotic optimal order for the 
parameter is available up to some unknown constants depending on the data generation 
parameters. We shall use the cross-validation (CV) in our simulation studies and real data 
analysis. 


Proof of Proposition 7. 5} Since we consider the uniform kernel, we may assume b n = 
1 , \N t \ — n and then rescale. Observe that 


max | a 2 k - a* 2 k \ 

\k\<h 


= max — 
\k\<h n 


< max — 
\k\<h n 


n—k 




2=1 

n—k 


^ +/c) 

V 2 /n-k 


2=1 

n—k 


n—k 


^ ^ ^ 2 +fc (^2 & i ) 

1/2 


2=1 


< max — ( 
\k\<h n ' 


E< 


< 


i= 1 


max — 
|fc|<h n 


E 4 


*n—k 

E 

v i =1 
1/2 


1^2+fc ^2+fcJ 

k 2=1 

1/2 /n _ k \ 1/2 

A „ \2 


'i -\-k 


E< 


. 2=1 


1/2' 


n 


2=1 


n 


E e ? 

2=1 


1/2 


u 


^ / o e i) 


2=1 
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By Lemma 7.3 


1 

- - e-i ) 2 = \£ t - £ t \ 2 2 = \X t [(3(t) - P(t)\\% = 0 P (A 2 s). 


Then, it follows from the last expression and n 1 Y^=i e l = Op(l) that 


max \a 2 k - cr* 2 k \ = Op(AiS 1/2 ). 
\k\<h 1 e,fe e ’ fel v ' 


Therefore 


Pmax(5 h (S e ) - B h {T£ e )) < h max |d 2 fc - a. 

\k\<h 


*2 | 
e,k I 


0 P (/rAis 1/2 ). 


□ 
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